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We consider a regularized least squares problem, with regularization by structured sparsity-inducing norms, 
which extend the usual l\ and the group lasso penalty, by allowing the subsets to overlap. Such regularizations 
lead to nonsmooth problems that are difficult to optimize, and we propose in this paper a suitable version of an 
accelerated proximal method to solve them. We prove convergence of a nested procedure, obtained composing 
an accelerated proximal method with an inner algorithm for computing the proximity operator. By exploiting the 
geometrical properties of the penalty, we devise a new active set strategy, thanks to which the inner iteration is 
relatively fast, thus guaranteeing good computational performances of the overall algorithm. Our approach allows 
to deal with high dimensional problems without pre-processing for dimensionality reduction, leading to better 
i_i computational and prediction performances with respect to the state-of-the art methods, as shown empirically 

both on toy and real data. 
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^ 1 Introduction 

Sparsity has become a popular way to deal with a number of problems arising in signal and image processing, 
statistics and machine learning fl8l . In a broad sense, it refers to the possibility of writing the solution in terms 
of a few building blocks. Often sparsity based methods are the key towards finding interpretable models in real- 
world problems. For example, sparse regularization based with £i-type penalties is a powerful approach to find 
sparse solutions by minimizing a convex functional Il47l ITT1 IT7I . The success of l\ regularization motivated ex- 
ploring different kinds of sparsity properties for regularized optimization problems, exploiting available a priori 
information, which restricts the admissible sparsity patterns of the solution. An example of a sparsity pattern is 
when the variables are partitioned into groups (known a priori), and the goal is to estimate a sparse model where 
variables belonging to the same group are either jointly selected or discarded. This problem can be solved by 
regularizing with the group l\ penalty, also known as group lasso penalty |51|. The latter is the sum, over the 
groups, of the euclidean norms of the coefficients restricted to each group. Note that, for any p > 1, the same 
groupwise selection can be achieved by regularizing with the l\l i v norm, i.e. the sum over the groups of the £ p 
norm of the coefficients restricted to each group. A possible generalization of the group lasso penalty is obtained 
considering groups of variables which can be potentially overlapping [52, 23J, and the goal is to estimate a model 
which support is the union of groups. For example, this is a common situation in bioinformatics (especially in 
the context of high-throughput data such as gene expression and mass spectrometry data), where problems are 
characterized by a very low number of samples with several thousands of variables. In fact, when the number of 
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samples is not sufficient to guarantee accurate model estimation, a possible solution is to take advantage of the 
huge amount of prior knowledge encoded in online databases such as the Gene Ontology Ifl4l . Largely motivated 
by applications in bioinformatics, the latent group lasso with overlap penalty is proposed in [21] and further studied 
in I35l l2l and in 11371 in the image processing context, which generalizes the l\l li penalty to overlapping groups, 
thus satisfying the assumption that the admissible sparsity patterns must be unions of a subset of the groups. 

All the methods proposed in the literature solve the minimization problem arising in [21 J by applying state- 
of-the-art techniques for group lasso in an expanded space, called space of latent variables, built by duplicating 
variables that belong to more than one group. The most popular optimization strategies that have been proposed 
are interior-points methods [3. 36], block coordinate descent 1271 . proximal methods 0321 [30j |37j [23 H2 j and the re- 
lated alternating direction method [15]. Very recently, the paper [39 1 proposed an accelerated alternating direction 
method and [40 1 studied a block coordinate descent, along with a proximal method with variable step-sizes. 

As already noted in Gil , though very natural, every implementation developed in the latent variables does 
not scale to large datasets: when the groups have significant overlap, a more scalable algorithm with no data 
duplication is needed. For this reason we propose an alternative optimization approach to solve the group lasso 
problem with overlap, and extend it to the entire family of group lasso with overlap penalties, that generalize the 
£i/£p penalties to overlapping groups for p > 1. Our method is a two-loops iterative scheme based on proximal 
methods (see for example [32, 6 5|), and more precisely on the accelerated version named FISTA f5|. It does not 
require explicit replication of the variables and is thus more appropriate to deal with high dimensional problems 
with large group overlap. In fact, the proximity operator can be efficiently computed by exploiting the geometrical 
properties of the penalty. We show that such an operator can be written as the identity minus the projection onto a 
suitable convex set, which is the intersection of as many convex sets as the number of active groups, that is groups 
corresponding to active constraints, which can be easily found. Indeed, the identification of the active groups is a 
key step, since it allows computing the projection in a reduced space. For general p, the projection can be solved 
via the Cyclic Projections algorithm [4] . Furthermore, for the case p = 2, we present an accelerated scheme, where 
the reduced projection is computed by solving a corresponding dual problem via the projected Newton method 
[7], thus working in a much lower dimensional space. 

The present paper completes and extends the preliminary results presented in the short conference version 
[31 1 . In particular, it contains a general mathematical presentation and all the proofs, which were omitted in [31 1. 
We next describe how the rest of the paper is organized, and then highlight the main novelties with respect to 
the short version. In Section |2j we cast the problem of Group-wise Selection with Overlap (GSO) as a regulariza- 
tion problem based on a modified €i/^ p -type penalty and compare it with other structured sparsity penalties. We 
extend the approach in [31 J for p = 2 to general p > 1. In Section [3| we describe the derivation of the proposed 



optimization scheme, and prove its convergence. Precisely, we first recall proximal methods in Subsection 3.1 then 



in Subsection 3.2 we describe the technical results that ease the computation of the proximity operator as a simpli- 
fied projection, and present different projection algorithms depending on p. With respect to [31], we show that our 
active set strategy can be profitably used in this generalized framework in combination with any algorithm chosen 
to compute the inner projection. Furthermore, to solve the projection for a general p e (1, +oo], we discuss the use 
of a cyclic projections algorithm, whose convergence in norm is guaranteed and results in a rate of convergence for 
the proposed proximal method, proved in Subsection |3.3| Section [4] is a substantial extension of the experiments 
performed in [31 ]. We empirically analyze the computational performance of our optimization procedure. We first 
study the performance of the different variations of the proposed optimization scheme. Then we present a set of 
numerical experiments comparing running time of our algorithm with state-of-the-art techniques. We conclude 
with a real data experiment where we show that the improved computational performance allows dealing with 
large data sets without preprocessing thus improving also the prediction and selection performance. Finally, in 
Appendix B we review the projected Newton method 171 . 

Notation. Given a vector x € R d , we denote with ||-|| the i?p-norm of x, defined as = (Xw=i x^) 1 ^ and 

ll x lloo = max je{i,—d} \ x j\- We will also use the notation ||x|| G = EjgG^j) 1 ^ f° r P — 1' II x IIgoo = 
maxjgG \xj\ to denote the ^ p -norm of the components of x in G c {1, . . . , d}. When the subscript p is omitted, the 
£ 2 norm is used, 1 1 • | = 1 1 ■ 1 1 2 . The conjugate exponent of p is denoted by q; we recall that q is such that 1/p + l/q = 1. 
In the following, X will denote R d and Y a bounded interval in M. 
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2 Group-wise selection with Overlap (GSO) 



This paper proposes an optimization algorithm for a regularized least-squares problem of the type 

min £P(x), SE(x) = - life - yf + 2tQ%(x) , (GSO-p) 

xeM d n v 

where ^ : W l — > K n is a linear operator, y £ M. n , and f2^ : R d — > [0, +00) is a convex and lower semicontinuous 
penalty, depending on a parameter p £ (1, +00], and on an a priori given group structure, Q = {G r }f =1 , with 
G r C { 1 , . . . , d} and G r = {1, . . . , d}. Note that other data fit terms could be used, different from the quadratic 
one, as long as they are convex and continuously differentiable with Lipschitz continuous gradient. We will focus 
on least squares to simplify the exposition. Most group sparsity penalties can be built starting from the family of 
canonical linear projections on the subspace identified by the indices belonging to G r , i.e. P r : M. d —> M. r . The 
definition of the penalties we consider is based on the adjoint of the linear operator 

B 

PiE^J]!^, Px=(P 1 x,...,P B x), 

r=l 

that is the operator 

B B 

P* : Y[R G " -^R d , P*( Vl ,...,v B ) = J2 P >r, 

r—l r—1 

where P* : M. Gr —> M. d is the canonical injection. For x £ M. d we set 

B 

Q s p (x)= min £lKHp- (1) 

P* V — X 

For p = 2, the functional Slf was introduced in [21 1 (see also ||35l 13). The distinctive feature of the family of 
penalties flp, is that they have the property of inducing group-wise selection, that is they lead to solutions with 
support (i.e. set of non zero entries) which is the union of a subsets of the groups defined a priori. In fact, can 
be seen as a generalization of the mixed t\/£ p norms, originally introduced for disjoint groups: 

B 

R s p (x) = Y,\\x\\ Gr , P , P>1- 

r=l 

For p = 2, is the group lasso penalty, and it is well-known [51 J that such penalties lead to solutions whose 
support is the union of a small number of groups. The penalty can be written also if the groups overlap, and 
more generally the composite absolute penalties (CAP) 

B 
r=l 

first introduced in Il52l and coinciding with R^J for 7 = 1, have been intensively studied. The J 7iP penalties allow to 
deal with complex groups structures involving hierarchies or graphs and it is proved in [23 1 that the CAP penalties 
constraint the support to be the complement of a union of groups, fij and R^ are thus somehow complementary 
and have different domain of applications Il23l l26l 24 1 . 

While many algorithms have been proposed to solve the optimization problem corresponding to R^, the one 
corresponding f2= is much less studied. This is due on the one hand to the fact that the penalty is more complex, 
and on the other hand to the widespread use of the "replication strategy". The latter is based on the observation 
that, using the definition of 0= , and the surjectivity of P* , the (GSO-p) minimization problem can be written as 

1 B 
min -||*P* u _ 2/ || 2 + 2rE||« r || (2) 
ten 1 R G - 71 1 

1 if — 1 r—1 
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which is a group lasso problem without overlap for the linear operator tyP* in the so called latent variables (v r )^ =1 , 
obtained by replicating variables belonging to more than one group. The last rewriting allows to apply every 
algorithm developed for the standard group-lasso to the overlapping case, but this strategy is not feasible for high 
dimensional problems with large group overlaps, as potentially many artificial dimensions are created. The main 
goal of this paper is to propose and study an optimization algorithm which does not require the replication of 
variables belonging to more than one group. 

The choice p > 1 has both technical and practical motivations. On the one hand, it guarantees convexity of the 
penalty - which can be shown to be a norm (see Lemma 1 in [21 1 for p = 2) -, and, as a consequence, of thel lGSOj) 
regularization problem (note that this is valid for p = 1 too). On the other hand, it enforces "democracy" among 
the elements that belong to the same group, in the sense that no intragroup sparsity is enforced, thus inducing 
group-wise selection. The case p = 1 is trivial, since the penalty flf coincides with the l\ norm, or lasso penalty 
El: 

and is thus independent of Q. 



Example 1. A particular instance of the above problem occurs in statistical learning. Assume that the estimator and 
the regression function can be described by a generalized linear model f(x) = X^=i x j ( x )> ror a given dictionary 
{^j}j=i of functions ■ : X -> Y (with X a set and Y C M). Given a training set {{xi,yi)^ =1 } € (X x Y) n the 
regularized empirical risk takes the form 



-||*x-y|| 2 + 2r^( a: ) ; 



with * : R d -> R n , = Ei=i ^j{xi)xj and y = (y 1} . . . , y n ). 

Example 2. Most results obtained in the paper hold in an infinite dimensional setting. In particular, our approach 
can be naturally extended to the multiple kernel learning(MKL) problem [28 1. For this problem, given reproducing 
kernel Hilbert spaces Hi, ■ . ■ ,H m of functions g : X — > M, defining T~L = Y^Li^-r, the resulting optimization 
problem takes the form (see 1281 ) 



mm 



r=l 



for a suitable * : "H — > W 1 , y € W 1 . As can be readily seen, the multiple kernel learning problem has the same 
structure of the I GSO-p) problem described above. 



3 An efficient proximal algorithm 



Due to non-smoothness of the penalty term, solving the I GSO-p I minimization problem is not trivial. Moreover, 



if one needs to solve the I GSO-p) problem for high dimensional data, the use of standard second-order methods 



such as interior-point methods is precluded (see for instance [6]), since they need to solve large systems of linear 
equations to compute the Newton steps. On the other hand, first order methods inspired to Nesterov's seminal 
paper [33] (see also Il32l ) and based on the proximal map are accurate, and robust, in the sense that their perfor- 
mance does not depend on the fine tuning of various controlling parameters. Furthermore, these methods were 
already proved to be a computationally efficient alternative for solving many regularized inverse problems in 
image processing [10J, compressed sensing [6] and machine learning applications ll2l [T6ll30l . 
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3.1 Proximal methods 



The (GSO-p) regularized convex functional is the sum of a convex smooth term, F(x) = — \\^>x — y\\ 2 , with Lip- 
schitz continuous gradient, and a non-differentiable penalty rfi=(-). A minimizing sequence can be computed 
with a proximal gradient algorithm [48 J (a.k.a. forward-backward splitting method [13], and Iterative Shrinkage 
Thresholding Algorithm (ISTA) 0) 

x m = prox xns (x" 1 - 1 - ^-VFix" 1 - 1 )) (ISTA) 

' P \ 2(7 J 

for a suitable choice of a, and any initialization x°. Recently, several accelerations of ISTA have been proposed 
l34l HHl With respect to ISTA, they only require the additional computation of a linear combination of two 
consecutive iterates. Among them, FISTA (Fast Iterative Shrinkage Thresholding Algorithm) [5| is given by the 
following updating rule for m > 1 



x m =prox xfig ( h m - 1 V/ 



-i = \ (l + \/i+4^) (FISTA) 



for a suitable choice of a > 0, s\ = 1, and any initialization h 1 = x°. Both schemes are based on the computation 
of the proximity operator [29], which is defined as 

prox Anc (2) = axgmin$A(x), with $ A (x) = — ||ac - z|| 2 + fl^(aj), A > 0. (3) 



Algorithm 1 FISTA for GSO-p 



Given: G, pE (1, +oo], r > 0, e > 0, a > 0, x° = ft e M d , s = 1, 
Let: cr = ||* T *||/n, m = and g such that ~ + ~ = 1. 
while convergence not reached do 



±-V T (Vh m - y) 



Find g m = {GeG, 



> 1} 



Approximately compute the projection of h m onto ^Kp™ '■= DggG" 
ance e m~ a 



he 



\h\ 



G,q 



< 



with toler- 



Wl = \ (l + V 1 + 4^4) 



end while 
return a;™ 



The convergence rate of £^(x m ) — min£^, for ISTA and FISTA, is 0(l/m) and 0(l/m 2 ), respectively, when the 
proximity operator is computed exactly. However, in general, the exact expression is not available. Recently, it 
has been shown that, also in the presence of errors, the accelerated version maintains advantages with respect to 
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the basic one. In fact, the rate 0(1/ m 2 ) for FISTA in the presence of computational errors was recently proved in 
1 45 ,50 1 for various error criteria. Convergence of ISTA with errors was already known, and first proved in 1HT1IT31 . 
Since the proximity operator of the penalty Qp 1 is not admissible in closed for m, the l|GSO-p) minimization prob- 



or 



FISTA where VF(h m ) is simply 



lem can thus be solved via an inexact version of the iterative schemes |ISTA 

2^ T (^>h m — y) /n. Note that, in the special case of not overlapping groups, the proximity operator can be explicitly 
evaluated group-wise, and reduces to a group-wise soft-thresholding operator. In the general case, as explained 
in Subsection 3.2 the proximity operator can be written in terms of a projection, and we will provide an algorithm 
to approximately compute it. Note also that we will show that at each step the projection involves only a subset 
of the initial groups, the active groups, thus significantly increasing the computational performance of the overall 
algorithm. 



3.2 Computing the proximity operator of Sl Q p 

In this subsection we state the lemmas that allow us to efficiently compute the proximity operator of Jl^ and to 
formulate the inexact version of FISTA reported in Algorithm [T] 

As a direct consequence of standard results of convex analysis, Lemma [T] shows that the computation of the 
proximity operator amounts to the computation of a projection operator onto the intersection of convex sets, each 
of them corresponding to a group. In Lemma |2j we theoretically justifies an active set strategy, by showing that 
when projecting a vector onto this intersection, it is possible to discard the constraints which are already satisfied. 

Lemma 1. For any A > and p > 1, the proximity operator of XQp 1 , where Jl^ is defined in Q, is given by 

P rox \ns = i -k\ks- 



where tt\ks denotes the projection onto XKp, and Kp is given by 



{a; e R d ,||a:|| Grifl < l,forr = l,...,B}. 



The proof exploits the particular definition of the penalty and relies on the Moreau decomposition 



prox XQ (x) = x- Aprox^ 



(4) 



(5) 



Formula Q allows to compute the proximity operator of Jl starting from the proximity operator of the Fenchel 
conjugate. In our case, being £7^ one homogeneous, we obtain the identity minus the projection onto a closed and 
convex set. The particular geometry of , which is the intersection of B convex generalized cylinders "centered" 
on a coordinate subspace, derives from definition of Hp and the explicit computation of its Fenchel conjugate. Ob- 
serve that by definition £1^ is the infimal convolution of B functions, and precisely the B norms on R Gr composed 
with the projections. By standard properties of the Fenchel conjugate, it follows that (ftp)* — i q , where i q is the 
dual function of ||-|L, i.e. the indicator function of the i q unitary ball in R Gr . We give here a self-contained proof 
which does not use the notion of infimal convolution. A different proof for the case p = 2 is given in l35l . 

Proof. We start by computing explicitly the Fenchel conjugate of £1^ . By definition, 



sup 

xGR d 



sup 



[X, u) — miu 



P* v=x 



sup 

x£l d 



B 



(£ P >r,u)-J2\K\\ p 



£IML 

r=l 

= Y1 SUP ( P r V r,U) - \\v r \\p 



sup {x, u) 
«e]lK G '' j 

, P* v—x 



V sup (v r ,P r u) -\v r \ J=Vt,(P r 1i), 



r=l 
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where i q is the Fenchel conjugate of ||-|| , i.e. the indicator function of the l q unitary ball in E Gr . We can rewrite 
the sum of indicator functions as X^=i L q{Pr u ) — l k s ( u )- It is well-known that 

Using the Moreau decomposition |5]l and basic properties of the projection we obtain 

prox An (a;) = x — Xn K g (x/X) = x — tt\kg ( x )- (6) 

□ 

The following lemma shows that, when evaluating the projection n K g (x), we can restrict ourselves to a subset 

of active groups, denoted by Q = Q (x) and defined in Lemma|2] This equivalence is crucial to speed up Algorithm 
[TJ in fact the number of active groups at iteration m will converge to the number of selected groups, which is 
typically small if one is interested in sparse solutions. 

Lemma 2. Given x € K d , it holds 

K\K6 (X) = 7T XK g (x) , (7) 

where G:= \g £ Q, \\x\\ G q > a} . 

Proof. Given a group of indices G and a number p > 1, we denote by Cq. v the convex set 

C G , p = {xeR d : ||z|| G)9 < 1}. 

To prove the result we first show that for any subset S CQ the projection onto the intersection \Kf } = C\geS^C*G,p 
is non-expansive coordinate-wise with respect to zero. More precisely, for all x e R d , it holds that \ttxk s ( x )i\ < \ x i\ 
for all i = 1, . . . , d and for all A > 0. By contradiction, assume that there exists an index j such that \tt X k s ( x )j\ > \ x ]\- 
Consider the vector x defined by setting 

\-KXKs{x)j if j^j 



Xj otherwise. 



First note that x e XK^, since ||x|| G < 



< A for all G £ S. On the other hand 

G,q 



d 



\\ x ~ x f =^2( X J ~ x i) 2 < x ~^\ks{ x ) 

which is a contradiction. To conclude, suppose that x £ XK^, with S C Q. If we prove that 

^XKg(x) = ir XK g\s(x), 

we are done. For the sake of brevity denote v = n XK g\s(x). Thanks to the non-expansive property it follows 

\vj\ < \xj\ for all j = 1, . . . ,d and therefore v £ XK^. Since v £ XKp^ s by hypotheses, we get that v £ XK&. 
Furthermore by definition of projection 

\\v — x\\ < \\w — x\\ , for every w £ XK^ S 

and a fortiori \\v — x\\ < \\w — x\\ for every w £ XK^. □ 
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3.2.1 The projection on for general p 

The convex set Kjj is an intersection of convex sets, precisely 

Geo 

where C GiP = {v €R d ,\\v\\ Gtq < 1}. 

For general p a possible minimization scheme for computing the projection in ||7) can be obtained by applying 
the Cyclic Projections algorithm [8J or one of its modified versions (see |4] and references therein). In the particular 
case of p = 2, we describe the Lagrangian dual problem corresponding to the projection onto iff, and we propose 
an alternative optimization scheme, the projected Newton method [7J, which better exploits the geometry of the 
set A'f, and in practice proves to be faster than the Cyclic Projections algorithm. Note that, in order to satisfy the 
hypothesis of Theorem 14 the tolerance for stopping the iteration must decrease with the outer iteration m. 



A simple way to compute the projection onto the intersection of convex sets is given by the Cyclic Projections 
algorithm JSJ, which amounts to cyclically projecting onto each set. Here we recall in Algorithm [2] a modification 
of the Cyclic Projections algorithm proposed by [4|, for which strong convergence is guaranteed (see Theorem 4.1 
in H). 



Algorithm 2 Cyclic Projections 



Given i£l' l ,{C Gl , p ,... I C GB!P } 
Let 1 = 0, w° = x and find CU .....C* „ 
while convergence not reached do 
1 = 1 + 1 

Let iv i the projection onto tCa 



/ 1 ' / /— 1 \ 

W = - X+- 7T;(w ) 

l + l l + l U ' 



end while 



In the following we describe how to compute each projection nc p r for specific values of p. 
p = 2. In this case q = 2, and the projection is trivial 

Kc G , 2 0)] 3 = 



r wfc if j G Gand ||u;|| Gi2 > r 
w, otherwise 



p = oo. In this case q = 1, and Cg,oo is an i\ ball when restricting to the coordinates in G. From Lemma 4.2 in 1 19 1, 
we have that if || w\\ 1 > t, then the projection of w onto the l\ ball of radius r, tB\, is given by the soft-thresholding 
operation 

[^rB 1 {w)] j = {\Wj\ - ) u)+sign(w i ) 

where [i (depending on w and r) is chosen such that J2j(\ w j \ ~ m)+ = T - 

We recall a simple procedure provided in |19| for determining [i. In a first step, sort the absolute values of the 
components of w, resulting in the rearranged sequence, w* > w* +1 > for all j. Next, perform a search to find k 
such that 

fc-l k 

Ek-^<^<Eh- w w)- 

J = l 7=1 

Then set y. = w* k + fc" 1 (E^i 1 !^* - w fc) - T ) 
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p ^ 2, +00. In these cases no known closed form for the projection on the set Ca r:P exist, but it can be efficiently 
computed using Newton's method, as done in E2l . 

3.2.2 The projection on for p = 2 

When p = 2, the projection onto iff amounts to solving the constrained minimization problem 

Minimize ||u — a;|| 2 

subjectto v e R d , \\v\\ G2 < r, for G e G, ^' 

which Lagrangian dual problem can be written in a closed form. Working on the dual is advantageous, since the 
number of groups is typically much smaller than d, and furthermore Lemma [2] guarantees that one can restrict to 
the subset of groups 

G:={GeG: \\x\\ G2 >r}=:{G ll ... 1 G^} (9) 

which in general is a proper subset of G- 

In the following theorem we show how to compute the solution to problem pb, by solving the associated dual 
problem. 

Theorem 1. Given x e R d , G — with G r C {1, . . . , d} r G as in |9) and r > 0, the projection ofx onto the convex 

set tK^ with K% — {v G R d : \\v\\ G 2 < r for r = 1, . . . , B} is given by 



n TK? (x) , = ^ forj = l,...,d (10) 



1 + Er=i Ki 



where A* is the solution of 



-xl 



argmax/(A), with /(A) = E ^ E A ^ ( n ) 

AeK+ j=l 1 + E r= i Irj^r r=l 

and l r j e^i/flZ fo 1 ifj belongs to group G r and otherwise. 

Proof The Lagrangian function for the minimization problem |8} is defined as 

B 



L(V,\) = \\ V - X f+J2\r(\\v\\ Gr -T 2 



= E 



(«,• - Xjf + ^ \ r l r<j 



B 

2 



E A - 7 



, r 

r=l 

Ea+E 1 ^) k ^ — ) - E — ? E A * 



T 2 + || a; || 2 



(12) 



where A e M B . The dual function is then 



x 2 



/(A)= MUv,\) = L\ p ,A] = -£ ^ E^' + IWI 2 - 

Since strong duality holds, the minimum of (4) is equal to maximum of the dual problem which is therefore 

Maximize /(A) 

subjectto A r > Oforr = 1,...,B. { ' 
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Once the solution A* to the dual problem 1 13 1 is obtained, the solution to the primal problem d8), v*, is given by 



i+e;=iA?i 



for j = l,...,d. 



r-J 



□ 



The dual problem can be efficiently solved, for instance, via Bertsekas' projected Newton method described in 
0, and here reported as Algorithm [5] in the Appendix, where the first and second partial derivatives of /(A) are 
given by 

drf(X)=t 1'"' ^ 

fe(l + Ef=i l-.Ar 

and 



drd s f(X) 



E 



2r 2 1 -1 

3 r ->3 s iJ 



if G r nG» = 



; (1 + Ef=i l s jA s )- 3 otherwise. 



i 2 Ej G G,.nG s 

Bertsekas' iterative scheme combines the basic simplicity of the steepest descent iteration [43] with the quadratic 
convergence of the projected Newton's method |9 1. It does not involve the solution of a quadratic program thereby 
avoiding the associated computational overhead. Its convergence properties have been studied in [7J and are 
briefly mentioned in next section. 



3.3 Convergence analysis of GSO-p Algorithm 

In this subsection we clarify the accuracy in the computation of the projection which is required to prove con- 
vergence of the Algorithm [T] As mentioned above, we rely on recent theorems providing a convergence rate for 
proximal gradient methods with approximations. 

Definition 1. We say that w is an approximation of n T / aK g (x) with tolerance e if \\w — -K T / crK g (x)\\ < e. 

Theorem 2. Given x° e R d , and a = \ |vI' T vl/| \/n. Assume that TT T / aK g (x m ) in Algorithm's approximately computed at 
step m with tolerance e m = e a /m a . 

• If a > 2, there exists a constant Cj := Cj{p, Q, xq, a, r, a) such that the iterative update < |ISTA) satisfies 



£1 



Cj 



(14) 



If a > 4, there exists a constant Cp ■= Cf(p, G,xo,ct,t, a) such that the iterative update < |FISTA) satisfies 

e*{x m )-e*{x*)<% 



(15) 



Proof. It is enough to show that there exists a constant C > (independent of w l and x m ) such that 



w l - TT TK g(x m ) 



< 



c 



min $ x + e.„ 



(16) 



where $ x is defined as in |3|. Then the statement directly follows from Proposition 1 and Proposition 2 in j45l . In 
order to prove equation ( [16) first note that thanks to the assumption U^ =1 G, = {!,... ,d} made at the beginning, 
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it easily follows from the definition that is a norm on R d , and therefore it is equivalent to the euclidean one. 
Thus, there exists a constant A (depending only on p and Q) such that 

Sl%(x)-n%(x') < A\\x-x'\\ , Vx,x' G R d . 

Next, let w and x be such that 

w - ir T / aK e(x)^ < 7, (17) 

for some 7 > (and suppose w.l.o.g. that 7 < 1). By LemmalTland by definition of prox xf2S (m) and $x (see 
equation (|3J) we have 

<&.!. (x — Tr T / crK g (x)) — min $z. . 
Thus, by equation (17) , and using the fact that is a norm 



= -l 




2r 1 




<^ 


w 


- 2r 





- w) = — ||w|| + n^(x-tu) 



a 
2t 



■K T , aK g(x) +-{W- TT T / aK g(x),7T T / aK g(x)} 



+ Sl%(x - n T/aK g (x j) + (n T/aK g (x) - w) 



< mm <f>x H 7 + -7A + 

2t t 



= min $ x 
< min $ x 



C7 



2t i + - a + a )i 



where A is such that sup„ eifS ||u|| < A and C = C(p, Q, a, r). Therefore, equation ( |16) holds with C as defined 
above. □ 

As it happens for the exact accelerations of the basic forward-backward splitting algorithm such as l33l 13, 
convergence of the sequence x m is no longer guaranteed unless strong convexity is assumed. 

By Theorem 3.1 in 0J, Algorithm [2] is strongly convergent, and therefore, given arbitrary e > and x e R d , 
there exists an index l m :— l rn {e) such that w lm produced through Algorithm [2] enjoys the property 



W l - TT TK (>{x m ) 



for every I > l m . 

Algorithm [l] combined with Algorithm [2] thus converges to the minimum of l |GSO-p) problem with rate 1/m 2 , 
if the projection is approximately computed with tolerance eo / m " with a > 4. Similarly, one can use ISTA instead 
of FISTA as updating rule in Algorithm [l] obtaining the convergence rate 1/m, and setting a > 2. It is clear that 
the choice of a defines the stopping rule for the internal algorithm (see Subsection |4T). 



Every other algorithm producing admissible approximations can be used in place of Algorithm [2] in the com- 
putation of the projection. In the case p = 2, we tested Bertsekas' projected Newton method, reported in the 
Appendix as Algorithm [5] Its convergence is not always guaranteed, since there are particular choices of x and Q 
for which the partial Hessian of the dual function is not strictly positive defined, as would be required to ensure 
strong convergence (see Proposition 3 and Proposition 4 in |7J). However, ideas which are useful for circumvent- 
ing the same problem for unconstrained Newton's method, such as preconditioning, could be easily adapted to 
this case, and convergence has always been observed in our experiments (for more details see the discussion in 1 7 1 
and also the comments at the end of the next subsection). 
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3.4 Computing the regularization path 

In Algorithm [3] we report the complete scheme for computing the regularization path for the Group-wise Selection 
with Overlap problem [ GSO-p[ l, i.e. the set of solutions corresponding to different values of the regularization 



parameter n > . . . > tt- Note that we employ the continuation strategy proposed in GDI. When computing the 

Algorithm 3 Regularization path for GSO-p 
Given: n > r 2 > • ■ ■ > tt, G,e > 0,v > 
Let: a = ||* T #||/n, x^ = 
fort = l,...,Tdo 
Initialize: x = x^- 1 ' 
while convergence not reached do 

• update x according to Algorithm [lj with the projection computed via Cyclic Projections or by solving the 
dual problem 
end while 
x (n) = x 
end for 

return x^ Tl \ . . . , x {tt) 



proximity operator with Bertsekas' projected Newton method, a similar warm starting is applied to the inner 
iteration, since the m-th projection is initialized with the solution of the (to — l)-th projection. Despite the local 
nature of Bertsekas' scheme, such an initialization empirically proved to guarantee convergence. 

3.5 The replicates formulation 



As discussed in Section |2j the most common method to solve I GSO-p) problem is to minimize the standard group 



£i/£ p regularization (without overlap) in the expanded space of latent variables in |2| built by replicating variables 



belonging to more than one group, thus working in a d-dimensional space with d = ££=1 \G r \. Setting * = <JP 
and Rp (v) — J^Li ll u i-|| p / problem |2) can be written as 

1 



mm — 



2rR G p (v). 



The main advantage of such a formulation relies on the possibility of using any state-of-the-art optimization 
procedure for £i/£ p regularization without overlap. In terms of proximal methods, a possible solution is given by 
Algorithm |3j where the proximity operator can be now computed group-wise as 

((P rOX AR S ( w ))j) = (1-TTxs ) ((UjOjgGr) 

for all r = 1, . . . , B, where SG r , P now denotes the £ q unitary ball in R G r . Furthermore for p = 2 and p = +oo 
each projection can be computed exactly as described in Subsection 3.2.1 , and the proximity operator of R p is thus 



exact. The optimization algorithm for solving I GSO-p I via FISTA in the replicated space is reported in Algorithm 

a 

The replicate formulation involves a much simpler proximity operator, but each iteration has higher compu- 
tational cost, since now depends on d rather than on d, and thus increases with the amount of overlap among 
variables subsets (see Section|4]for numerical comparisons between the projection and replication approaches). 

4 Numerical experiments 

In this section we present numerical experiments aimed at studying the computational performance of the pro- 
posed family of optimization algorithms, and at comparing them with the state-of-the-art algorithms applied to 
the replicate formulation. 
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Algorithm 4 FISTA for Group-wise Selection without overlap 

Given: v° e Ilf=i RGr ; T > °> CT = ||* T *HAi 
Initialize: m = 0, w 1 = v°, t 1 = 1 
while convergence not reached do 
for r = 1, . . . ,B do 



end while 
return v m 



4.1 Cyclic Projections vs dual formulation 

We build £? groups,{G r }f =1 , of size b, with G r C {1, . . . , d}, by randomly drawing sets of b indexes from {1, . . . ,d}, 
and consider the cases 6=10, and 6 = 100. We vary the number of groups B, so that the dimension of the expanded 
space is a times the input dimension, d = ad, with a = 1.2, 2 and 5. Clearly this amounts to taking B = a ■ d/b. 
We then generate a vector x £ M. d by randomly drawing each of its entry from Af(0, 1). We then pick a value of r 
such that, when computing prox TnS (x), all groups are active. Precisely we take t = .8 • min r =i j ... ) B ||x|| G 2 . We 

first compute the exact solution x^ = prox f2 s (as) FJ Then we compute the approximated solutions with the Cyclic 
Projections Algorithm [2] and by solving the dual via the projected Newton method. We will refer to the former as 
CP2 and to the latter as dual. We stop the iteration when the distance from the exact solution is less than e the norm 
of xK We consider different values for the tolerance e, precisely we take e = 10~ 2 , 10~ 3 , 10~ 4 . 

Mean and standard deviation of the computing time over 20 repetitions are plotted in Figure [T] and [2] for each 
value of a and e. The dual formulation is faster than the Cyclic Projections algorithm in most situations. It is 
convenient to use Cyclic Projections when the number of active groups is high and the required tolerance very 
low. When computing the projection for Algorithm [TJ it is thus reasonable to use Cyclic Projections in the very 
first outer iterations, when the tolerance - which depends on the outer iterations - is low, and the solution could 
be not sparse, because still far from convergence. After few iterations, it is more convenient to resort to the dual 
formulation. Even though, not optimal, in the following experiments, when denoting GSO-2 via projection we 
will consider always the projection computed with the dual formulation. 

4.2 Projection vs replication 

In this Subsection we compare the running time performance of the proposed set of algorithms where the proxim- 
ity operator is computed approximately, to state-of-the-art algorithms used to solve the equivalent formulation in 
the replicated space. For such a comparison we restrict to p = 2, since many benchmark algorithms are available 
in the case of groups that do not overlap. In order to ensure a fair comparison, we first run some preliminary 
experiments to identify the fastest codes for group t\ regularization with no overlap. 

it is the solution computed via the projected Newton method for the dual problem with very tight tolerance 




end for 
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Figure 1: Computing time (in seconds) necessary for evaluating the prox vs number of variables (d), for different 
values of the overlap degree a and the tolerance, for fixed group size b = 10. 
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Figure 2: Computing time (in seconds) necessary for evaluating the prox vs number of variables (d), for different 
values of the overlap degree a and the tolerance, for fixed group size b = 100 
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4.2.1 Comparison without overlap 

Recently there has been a very active research on this topic, see e.g. ll39ll40lH2ll . For the comparison, we consid- 
ered three algorithms which are representative of the optimization techniques used to solve group lasso: interior- 
point methods, (group) coordinate descent and its variations, and proximal methods. As an instance of the 
first set of techniques we employed the publicly available Matlab code at Ihttp : / /www . di . ens . f r/ ~ fbach/| 
|grouplasso/index . htm| described in JTJ . For coordinate descent methods, we employed the R-package grlplasso, 
which implements block coordinate gradient descent minimization for a set of possible loss functions. In the fol- 
lowing we will refer to these two algorithms as "IP" and "BCGD". Finally, as an instance of proximal methods, 
we use our Matlab implementation of FISTA for Group-wise Selection, namely Algorithm [4] with FISTA instead of 
ISTA as updating rule. We will refer to it as "PROX". 

We first observe that the solutions of the three algorithms coincide up to an error which depends on each 
algorithm tolerance. We thus need to tune the each tolerance in order to guarantee that all iterative algorithms 
are stopped when the level of approximation to the true solution is the same. Toward this end, we run Algorithm 
PROX with machine precision, v = 10~ 16 , in order to have a good approximation of the asymptotic solution. 
We observe that for many values of n and d, and over a large range of values of r, the approximation of PROX 
when v = 10~ 6 is of the same order of the approximation of IP with optparam . tol = 10~ 9 , and of BCGD with 
tol = 1CP 12 . Note also that with these tolerances the three solutions coincide also in terms of selection, i.e. their 
supports are identical for each value of r. Therefore the following results correspond to optparam . tol = 10~ 9 
for IP, tol = 10~ 12 for BCGD, and v = 10~ 6 for PROX. For the other parameters of IP we used the values used in 
the demos supplied with the code. 

Concerning the data generation protocol, the input variables x = (xi, . . . , Xd) are uniformly drawn from [—1, l] d . 
The labels y are computed using a noise-corrupted linear regression function, i.e. y = x-x + w, where x depends on 
the first 30 variables, Xj = cif j = 1, . . . , 30, and otherwise, w is an additive noise, w~N(0, 1), and c is a rescaling 
factor that sets the signal to noise ratio to 5:1. In this case the dictionary coincides with the variables, ^j(x) = Xj 
for j = 1, . . . , d. We then evaluate the entire regularization path for the three algorithms with B sequential groups 
of 10 variables, (Gi=[l, . . . , 10], G ? 2=[H ) ■ ■ ■ , 20], and so on), for different values of n and B. In order to make sure 
that we are working on the correct range of values for the parameter r, we first evaluate the set of solutions of 
PROX corresponding to a large range of 500 values for r, with v = 10~ 4 . We then determine the smallest value of t 
which corresponds to selecting less than n variables, T min , and the smallest one returning the null solution, r max . 
Finally we build the geometric series of 50 values between T m i n and r max , and use it to evaluate the regularization 
path on the three algorithms. In order to obtain robust estimates of the running times, we repeat 20 times for each 
pair n,B. 

In Table [T] we report the computational times required to evaluate the entire regularization path for the three 
algorithms. Algorithms BCGD and PROX are always faster than IP which, due to memory reasons, cannot be 
applied to problems where the number of variables are more than 5000, since it requires to store the d x d matrix 
"fx*. It must be said that the code for GP-IL was made available mainly in order to allow reproducibility of 
the results presented in [1 J, and is not optimized in terms of time and memory occupation. However it is well 
known that standard second-order methods are typically precluded on large data sets, since they need to solve 
large systems of linear equations to compute the Newton steps. PROX is the fastest for B = 10, 100 and has a 
similar behavior to BCGD. The candidates as benchmark algorithms for comparison with FISTA via projection are 
therefore BCGD and PROX. Since we are more familiar with the PROX algorithm, we therefore compare FISTA via 
projection with the PROX algorithm, i.e. FISTA via replication only. 

4.2.2 Comparison with overlap 

Here we compare two different implementations of the GSO-2 solution: FISTA via approximated projection com- 
puted by solving the dual problem with projected Newton method, and FISTA via replication. We will refer to the 
former as FISTA-proj, and to the latter as FISTA-repl. 

The data generation protocol is equal to the one described in the previous experiments, but x depends on the 
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Table 1: Running time (mean and standard deviation) in seconds for computing the entire regularization path of 
IP, BCGD, and PROX for different values of B, and n. 



n = 100 



n = 500 



n = 1000 





B = 10 


B = 100 


TP 


r. a _i_ n a 
o . u zn u. u 


uu zn yu 




^ . _l zn u. u 


^.o zn u.u 


PROX 


0.21 ±0.04 


2.9 ±0.4 




B = 10 


B = 100 


IP 


2.30 ±0.27 


370 ± 30 


BCGD 


2.15 ±0.16 


4.7 ±0.5 


PROX 


0.1514±0.0025 


2.54 ±0.16 




S = 10 


B = 100 


IP 


1.92 ±0.25 


328 ± 22 


BCGD 


2.06 ±0.26 


18 ±3 


PROX 


0.182 ±0.006 


4.7 ±0.5 



first 12/56 variables (which correspond to the first three groups) 

x = ( c, c , 0, 0, „ . , ). 

fc-12/5 times d-6-12/5 times 

We then define £? groups of size 6, so that d — B ■ b > d. The first three groups correspond to the subset of relevant 
variables, and are defined as G 1 = [1, . . . , b], G 2 = [4/56 + 1, . . . , 9/56], and G 3 = [1, . . . , 6/5, 8/56 + 1, . . . , 12/56], 
so that they have a 20% pair-wise overlap. The remaining B — 3 groups are built by randomly drawing sets 
of 6 indexes from {1, d}. In the following we will let n = 10|Gi UG 2 U G 3 |, i.e. n is ten times the number of 
relevant variables, and vary d, b. We also vary the number of groups B, so that the dimension of the space of latent 
variables is a times the input dimension, d = ad, with a = 1.2,2,5. Clearly this amounts to taking B = a-d/b. The 
parameter a can be thought of as the average number of groups a single variable belongs to. We identify the correct 
range of values for r as in the previous experiments, using FISTA-proj with loose tolerance, and then evaluate the 
running time and the number of iterations necessary to compute the entire regularization path for FISTA-repl on 
the expanded space and FISTA-proj, both with v = 10~ 6 . Finally we repeat 20 times for each combination of the 
three parameters d, 6, and a. 

Table 2: Running time (mean ± standard deviation) in seconds for 6= 10 (top), and 6 = 100 (below). For each d and 
a, the left and right side correspond to FISTA-proj, and FISTA-repl, respectively. 







a = 


1.2 


a = 


= 2 


a 


= 5 


d=1000 


0.15 ±0.04 


0.20 ±0.09 


1.6 ±0.9 


5.1 ±2.0 


12.4 ± 1.3 


68 ±8 


d=5000 




1.1 ±0.4 


1.0 ±0.6 


1.55 ±0.29 


2.4 ±0.7 


103 ± 12 


790 ± 57 


d=10000 




2.1 ±0.7 


2.1 ± 1.4 


3.0 ±0.6 


4.5 ± 1.4 


460 ± 110 


2900 ± 400 






a = 


1.2 


a — 


2 


a = 


= 5 


rf=1000 




11.7 ±0.4 


24.1 ±2.5 


11.6 ±0.4 


42 ±4 


13.5 ±0.7 


1467 ± 13 


d=5000 




31 ±13 


38 ± 15 


90 ±5 


335 ±21 


85 ±3 


1110 ±80 


d =10000 


16.6 ±2.1 


13 ±3 


90 ±30 


270 ± 120 


296 ± 16 
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Table 3: Number of iterations (mean ± standard deviation) for b = 10 (top) and b = 100 (below). For each d and a, 
the left and right side correspond to FISTA-proj, and FISTA-repl, respectively. 





a = 


1.2 




a 


= 2 




a — 


5 


d= 


=1000 


100 ± 30 


80 ±30 


1200 ± 500 


1900 ± 800 


2150 ± 160 


11000 ± 1300 


d-- 


=5000 


100 ± 40 


70 ±30 


148 ± 25 


139 ± 24 


6600 ± 500 


27000 ± 2000 


d-- 


=10000 


100 ± 30 


70 ±40 


160 ± 30 


137 ± 26 


13300 ± 1900 


49000 ± 6000 






a 


= 1.2 






a = 2 




a 


= 5 


d 


=1000 


913 ± 12 


2160 ±210 


894 ± 11 2700 ±300 


895 ± 10 


4200 ± 400 


d 


=5000 


600 ± 400 


600 ± 300 


1860 ± 110 4590 ±290 


1320 ± 30 


6800 ± 500 


d 


=10000 


81 ±11 


63 ± 11 


1000 ± 500 1800 ± 900 


2100 ±60 





Running times and number of iterations are reported in Table [2] and |3j respectively. When the overlap, that is 
a, is low the computational times of FISTA-repl and FISTA-proj are comparable. As a increases, there is a clear 
advantage in using FISTA-proj instead of FISTA-repl. The same behavior occurs for the number of iterations. 



4.3 p = 2 vs p = oo 



We generate the groups and the coefficient vector as in Subsection 4.1 with b = 10. Differently from the Sub- 
section 



4.1 



here we compare the computational performance of the same algorithm applied to two different 
problems: Cyclic Projections for p = 2 and Cyclic Projections for p — oo, that yield different solutions, since 
prox TQ g(x) prox Tf2 g (x). In order to guarantee a fair comparison we consider two different values of r, 
and r^, such that, when computing prox^s (x) and prox r nS (x), all groups are active. Precisely we take 

T2 = .8 • min I . = i.... ! B \\%\\g 2- T =o = -8 ■ min r= i r . b \\x\\ g . We compute the approximated solutions with 
the Cyclic Projections Algorithm [2] for p = 2 and p — oo. We will refer to the former as CP2 and to the latter 
as CPinf . We stop the iteration when the relative decrease of the approximated solution is below e. We consider 
different values for the tolerance e, precisely we take e = 10~ 2 , 10~ 3 , 10~ 4 . 

For each value of a and e we estimate the number of iterations, and the computing time for the two algorithms, 
and average over 20 repetitions. Mean and standard deviation of number of iterations and the computing time are 
plotted in Figure [3] and [4] In all conditions CP2 is much faster than CPinf. 



4.4 Real data Experiments: Microarray data 

In the previous subsection we have shown that, thanks to the computational efficiency of the proposed projection 
algorithm, the GSO-p regularization scheme can be easily applied to large data sets with large group overlap. 
Here we show that on real data, indeed, dealing with the entire data set without resorting to preprocessing leads 
to improved prediction and selection performance. We consider the microarray experiment presented in [21 1 
where the breast cancer dataset compiled by |49| (8141 genes for 295 tumors) is analyzed with the group lasso 
with overlap penalty and the 637 gene groups corresponding to the MSigDB pathways |46|. In [21 J the accuracy 
of a logistic regression is estimated via 3-fold cross validation. On each split the 300 genes most correlated with 
the output are selected and the optimal r is chosen via cross validation. 6, 5 and 78 pathways are selected with 
a 0.36 ± 0.03 cross validation error. We applied FISTA-proj to the entire data set with two loops of k-fold cross 
validation (k= 3 for testing). The obtained cross validation error is 0.33 ± 0.05 and 0.30 ± 0.06, with k= 3 and 
k= 10 for validation, respectively. In both cases the number of selected groups is 2,3, and 4, with 1 group in 3, 
and 3 pathways selected in 2 out of 3 splits. The computing time for running the entire framework for FISTA- 
proj (comprising data and pathways loading, recentering, selection via FISTA-proj, regression via RLS on the 
selected genes, and testing) is 850s (k=3) and 3387s (k=10). Note that, while the improved cross validation error 
might be due to the second optimization step (RLS), the improved stability is probably due to the absence of the 
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Figure 3: Number of iteration necessary for evaluating the prox vs number of variables (d), for different values of 
the overlap degree a, and the tolerance. 




Figure 4: Computing time (in seconds) necessary for evaluating the prox vs number of variables (d), for different 
values of the overlap degree a, and the tolerance 
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preprocessing step, which can be highly unstable, thus compromising the overall stability of the solution. 

5 Discussion 

We have presented an efficient optimization procedure for computing the solution of a set of regularization 
schemes that perform group-wise selection with overlapping groups, whose convergence is guaranteed. Our pro- 
cedure allows dealing with high dimensional problems with large group overlap. We have empirically shown that 
it has a significant computational advantage with respect to state-of-the-art algorithms for group-wise selection 
applied on the expanded space built by replicating variables belonging to more than one group. We also mention 
that computational performance may improve if our scheme is used as core for the optimization step of active set 
methods, such as |44|. Finally, the improved computational performance enables to use group-wise selection with 
overlap for pathway analysis of high-throughput biomedical data, since it can be applied to the entire data set and 
using all the information present in online databases, without pre-processing for dimensionality reduction. 
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A Projected Newton Method 

In this appendix we report as Algorithm [5] Bertsekas' projected Newton method described in 0, with the modif- 
cations needed to perform the maximization of a concave function instead of the minimization of a convex one. 

Algorithm 5 Projection onto iff 

Given: x G R d ,Xm e R § ,r) e (0, 1), S G (0,l/2),e>0 
Initialize: I = 0, A = A Mt 

while (d r f(X l ) ^ if A,. > 0, or d r f(X l ) > if A r = 0, for some r = 1, ...,£?) do 
l:=l + l 

et=mm{e,\\\ l -[\ l +Vf(\ l )]+\\} 
I'+ = {r : < X l r < e h d r f(X l ) < 0} 

H l = |0 i f r ^ s, and r E l l + or s E Z\ ^ 

r ' s \d r d s f(X l ) otherwise 

X(a) = [X l -a(H l )- 1 S7f(X l )} + 

m = 

while /(A(rT))-/(A0 < S E r ^' + Ef=i d r f(X l )[(H l )-\ s d s f(X l ) + J2 reXl+ d r f(X l )[K(v m ) - K-}} do 

m : = m + 1 
end while 

X l+1 = A ^w.) 

end while 
return A' +1 

The step size rule, i.e. the choice of a, is a combination of the Armijo-like rule Il43l and the Armijo rule usually 
employed in unconstrained minimization (see, e.g., 1381 ). 
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